Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Seed Dataset to improve compatibility and simplify usage #1734

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

apokryphosx
Copy link
Collaborator

@apokryphosx apokryphosx commented Mar 7, 2025

Description

This PR solves:
#1732
#1733

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • I have read the CONTRIBUTION guide (required)
  • I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
  • I have checked if any dependencies need to be added or updated in pyproject.toml and poetry.lock
  • I have updated the tests accordingly (required for a bug fix or a new feature)
  • I have updated the documentation if needed:
  • I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

initialized from HF/Pytorch/JSON/list of Dicts,
remove the need for setup call and subsequently
cleanup
@apokryphosx apokryphosx linked an issue Mar 7, 2025 that may be closed by this pull request
2 tasks
@apokryphosx apokryphosx requested a review from hallerite March 7, 2025 13:52
@hallerite hallerite added the P0 Task with high level priority label Mar 7, 2025
@hallerite hallerite added this to the Sprint 24 milestone Mar 7, 2025
@hallerite hallerite added enhancement New feature or request Refactor labels Mar 7, 2025
Copy link
Collaborator

@hallerite hallerite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @apokryphosx, I left some comments.

apokryphosx and others added 12 commits March 9, 2025 10:08
instead of strings and add seed for reproducibility
between simply skipping invalid datapoints in a
seed dataset and throwing an exception
seed dataset to ensure they are defined before the
other functions are
getitem and cast len(data) to a Sized to pass mypy
tests
Copy link
Collaborator

@hallerite hallerite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @apokryphosx, left some comments.

), "Dataset does not support indexing."

self._raw_data = [dict(data[i]) for i in range(len(data))]
elif isinstance(data, Path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some more safety features here. The current configuration does not:

  • check if the list contains only dictionaries
  • check if the file is valid JSON before attempting to load
  • handle potential encoding issues

validating each against the DataPoint schema.
This class can initialize from Hugging Face Datasets,
PyTorch Datasets, JSON file paths, or lists of dictionaries,
converting them into a consistent internal format.
"""

def __init__(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This __init__ is getting really messy. It would be better to create specific methods to handle the conversion.

logger.debug("No raw data to process")
return

def create_datapoint(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is supposed to validate whether a dict is a valid DataPoint. Providing default values for non-existant Keys defeats the whole purpose.

Note that it should not fail if there are extra fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P0 Task with high level priority Refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seed Dataset has compatibility issues Seed Dataset has unexpected behavior
2 participants